Abstract

When I was choosing the theme on my capstone project I was sure about two things:

I am glad that with the help of the Data Science course and the project, I was able to reach the desired result. As I consider the connection shown below trully exciting,let me first of all visualize the connection among different courses that helped me to accomplish my capstone project.

boxes_and_circles Calculus Calculus Probability Probability Calculus->Probability Optimization Optimization Calculus->Optimization Statistics Statistics Probability->Statistics MMA MMA Statistics->MMA Optimization->MMA ML ML MMA->ML Data_Science Data_Science MMA->Data_Science Database_Sys Database_Sys Database Database Database_Sys->Database NLP NLP ML->NLP:Classification Data_Science->NLP:Classification Visualization Visualization Data_Science->Visualization Database->Visualization Capstone Capstone NLP:Classification->Capstone Visualization->Capstone Sololearn Sololearn Sololearn->Database

So, from the graph above we may infere that my capstone project consists of two main parts.

Both of them are accomplished using SoloLearn’s datasets.

What is a SoloLearn?

SoloLearn is an Armenian StartUp aimed to teach coding to everyone from anywhere and from any background. It’s a mobile code learning platform, which can be used by anyone who has the desire to learn coding. https://www.sololearn.com/.

Visualizing

Which data is visualized?

As the data of SoloLearn is really big, the decision was made to subset it’s 10.000.000 users data to 100.000 and do the visualizations on that smaller data.

For visualization purposes the region’s from where SoloLearn top 20 users come from were taken. Then the data on each of those regions have been collected : the total number of users in that particular region, then the number of users by their level(in app level) in that region and this was visualized using different interactive R packages.

From which continents are the Top 15 Users?

Geo=gvisGeoChart(continents, locationvar="country_code", colorvar="users_total",
                 options=list(colors="['#aeff04', '#9eea00', '#6a9e00']",
                 title="You can see from which continents are the top users!",
                 titleTextStyle="{color:'green',fontName:'Courier',fontSize:16}",
                 bar="{groupWidth:'100%'}"))
plot(Geo)
34,368

Top 5 continents having the most Users!

Pie <- gvisPieChart(sub_cont,
                      options =list(
                      is3D=TRUE,
                      pieStartAngle=300,
                      title="Number of users from top 5 continents!",
                      titleTextStyle="{color:'green',fontName:'Courier',fontSize:16}",
                      bar="{groupWidth:'100%'}"))
plot(Pie)
Number of users from top 5 continents!usingbcaru60.9%
country_codeusers_totall1l2l3l4l5l6l7l8l9l10l11l12l13l14l15l16
us4,3687471,2251,00264752617532940100000
in1,5153323883012051907319231010000
gb5511071321057692334010000100
ca38750104796065196400000000
ru3502664666983346200000000

Bubble Charts with user levels in each country!

In this graph the statistics of 4 continents: Armenia, Turkey,Kenia and Antarctica is presented.Yes, Artarctica, we have 3 users from Antarctica, which is something we were not expecting to find :)

Preparing data for visualizing with BubblePlot.

    continents<-continents[which( 
    continents$country_code=="am" |
    continents$country_code=="tr" |
    continents$country_code=="aq" |
    continents$country_code=="ke"),]
    melted_continenets<-melt(continents, id.vars = "country_code",
                         measure.vars = c("l1", "l2","l3", "l4",
                                          "l5", "l6","l7", "l8",
                                          "l9", "l10","l11", "l12",
                                          "l13", "l14","l15", "l16"))
## Warning in melt_dataframe(data, as.integer(id.ind - 1),
## as.integer(measure.ind - : '.Random.seed' не является целочисленным
## вектором, он типа 'NULL', и поэтому пропущен
    melted_continents<-melted_continenets[-which(melted_continenets$value==0),]
    names(melted_continents)[2]<- "levels"
    melted_continents$levels<-as.numeric(melted_continents$levels)
Bubble <- gvisBubbleChart(melted_continents, idvar="country_code", 
                          xvar="levels", yvar="value",sizevar="levels",
                          colorvar="country_code",
                           options =list(
                             colors="['#aeff04', '#9eea00', '#6a9e00']",
                      title="BubbleChart for Antarctican, Kenyan, Armenian, Turk users.",
                      titleTextStyle="{color:'green',fontName:'Courier',fontSize:16}",
                      bar="{groupWidth:'100%'}")
                        )
plot(Bubble)
BubbleChart for Antarctican, Kenyan, Armenian, Turk users.trkeamaqamamamkeamtramketraqtrkeamaqtrkeamaqamketramketrtr04812160.012.525.037.550.0
country_codelevelsvaluecountry_code.1levels.1
tr125tr1
ke19ke1
am11am1
tr233tr2
ke25ke2
am214am2
tr343tr3
ke313ke3
am313am3
aq31aq3
tr434tr4
ke41ke4
am414am4
aq41aq4
tr533tr5
ke53ke5
am511am5
aq51aq5
tr615tr6
ke62ke6
am620am6
tr72tr7
am73am7
ke81ke8
am81am8
am91am9
am131am13

Armeninan Users Statistics!

Armeinan users by their levels are presneted below.

column_chart <- gvisColumnChart(continents, xvar="country_code", 
                            yvar=c("l1", "l2","l3", "l4",
                                    "l5", "l6","l7", "l8",
                                    "l9", "l10","l11", "l12",
                                    "l13", "l14","l15", "l16"),
                            options=list(
                              title="Armenian Users Advancement in SoloLearn",
                              titleTextStyle="{color:'green',fontName:'Courier',fontSize:16}",
                              bar="{groupWidth:'100%'}")
                          )
plot(column_chart)
Armenian Users Advancement in SoloLearnl1l2l3l4l51/4am05101520
country_codel1l2l3l4l5l6l7l8l9l10l11l12l13l14l15l16
am114131411203110001000

Summarize all the statistics via Shiny!

You can get familiar with the app, via running the UI file from the package sent.

Natural Language Processing:Machine Learning:Classification

Bayes Classifier for classifying SoloLearn comments.

The nature of the problem.

SoloLearn is a mobile code-learning platform,where people from different spheres, backgrounds and cultures learn to code. According to the SoloLearn users, one of the greatest features that app provides is a discussion forum. This is the place where the users ask questions,share and exchange their knowledge. But as a user base is big and multicultural(as we could infere from the visualization part) sometimes the discussions on the forum do not fit into the scope of the coding content. So, our moderators spend a long time for filterring those discussions. But as the apps data is growing exponentially, after learning about Text Classification,I thought it would be great to apply it and automize this process of filtering bad(spammy) comments in the discussions.

As the moderators were spending time on classifying the comments by hand, we have already classified data set, which can be used for training our classifier.

There are different classifiers that can be chosen for this particular problem, but in the process of research the conclusion was made, that among less computer intensive classifiers, Naive Bayes is doing a really good job and it can be the one to be applied to the SoloLearn comments classification problem.

Note: In the reference part You can get familiar with the sources that had influenced the classifier’s choice.

Understanding the Naive Bayes Classifier(on small example)!

In this part we will try to understand how Bayesian classification is performed on a small data taken from the SoloLearn discusssions.

So, let’s suppose we have six comments given below, from which four are spam and two are non-spam(ham) comments.And our goal is to predict if a new(non-classified) comment is going to be a spam or not. Comments:

  1. send me your what’s up number :spam
  2. send me your c++ code :ham
  3. there is an error in code, please review :ham
  4. review my website ‘https://en.wikipedia.org/wiki/Natural_language_processing’ :spam
  5. burey, send me your number, please :spam
  6. like my code :spam

new comment we need to classify: please, send me your js code of web paint project…

Now, as we have a problem well formulated, let’s go step by step through the process of constructing the classifier.

Fisrt:
We need to calculate the prior probability of spam and ham.

P(spam)= num of spam in data / number of total data = 4/6

P(ham)= num of ham in data / number of total data = 2/6

Second:

We need to take all the individual words that has ever been seen in comments. Then we need to build a basis vocabulary from that words and count how many times we have encountered that particular word in spam vs in ham comments. The vocabulary constructed on data above is given below.

spam ham word
2/4 1/2 send
2/4 0/2 number
1/4 2/2 code
1/4 1/2 please
1/4 1/2 review

Note: we are not going to put in our vocabulary the words that appear less than 2 times in our data, and the words like [me, your, is, an etc.] as they are not going to convey important information. So, they will not have a positive effect on our classifier.

In Natural Language Processing this process is called cleaning the data. More details of this process will be covered in the cleaning part of the real SoloLearn data.

Third:
Calculating the Likelihood Probabilities. So, we are going to predict what is the probability of our new comment: “please, send me your js code of web paint project…” being spam vs ham. We need to eliminate the “new” words from our new comment and let only the one’s that we have information about in our vocabulary.So let’s see the conversion below.

please, send me your js code of web paint project… -> please send code

Now, we must convert the comment to attribute value representation, according to the basis vocabulary words that we have. For the word that has appeared in a vocabulary and in a comment we put 1 and 0 if it has not appeared in a comment.

Comment's Attribute Value Representation :please send code -> 10110

Calculating the Likelihood:

P(please send code/spam)=P(10110/spam)=(2/4)(1-2/4)(1/4)(1/4)(1-1/4)=0.012

P(please send code/ham)=P(10110/ham)=(1/2)(1-0)(1)(1/2)(1-1/2)=0.125

Finally! We are ready to apply Bayes Rule for Classification:

https://en.wikipedia.org/wiki/Naive_Bayes_classifier

Bayes Rule: posterior= prior X likelihood / evidence

Calculating Posterior probabilities:

P(spam/10110)=(0.67)(0.012)/(0.012)(0.67) + (0.125)(0.34)=0.16

P(ham/10110)=(0.34)(0.125)/(0.012)(0.67) + (0.125)(0.34)=0.84

So, from the posterior probabilities calculated above we are happy to conclude that the probability of the new comment :please, send me your js code of web paint project… being spam is 0.16 and being ham is 0.84. Therefore, we can conclude that the new comment is more likely to be a ham.

Which is a good result as we want this kind of comment to stay in SoloLearn app’s discussion forum.

Getting Familiar with Spam data

Now, as we have understood the intuition under the Naive Bayes classifier we can use libraray(e1071) for constructing the classifier in R.

But before there are couple of things we need to do. Below are provided the steps for the solution of our Classification Problem:

Cleaning the Text

  • removing numbers from the comments The majority would likely be unique to individual senders and thus will not provide useful patterns across all comments.
  • remove filler(stop) words such as to, and, but, and or from our comments stopwords() simply returns a vector of stop words, which cen be filled with our own preferred list.
  • removing punctuation or replacing with ‘’
  • stemming - reducing words to their root This allows machine learning algorithms to treat the related terms as a single concept rather than attempting to learn a pattern for each variant.
  • remove additional whitespace
commentsData_corpus_clean <- tm_map(commentsDataCorpus,
                               content_transformer(tolower))
commentsData_corpus_clean <- tm_map(commentsData_corpus_clean,
                               removeNumbers)
commentsData_corpus_clean <- tm_map(commentsData_corpus_clean,
                               removeWords,stopwords()) 
commentsData_corpus_clean <- tm_map(commentsData_corpus_clean,
                               removePunctuation)

Splitting text documents into words

Now that the data are processed to our liking, the final step is to split the comments into individual components through a process called tokenization. - A token is a single element of a text string; In this case: token == words. Finally, we create a Document Term Matrix (DTM) in which rows indicate documents (comments comments) and columns indicate terms (words).

Note: The order of cleaning steps matters!

comments_dtm <- DocumentTermMatrix(commentsData_corpus_clean)

Creating training and test datasets

  • divide the data into two portions: training:75%, testing:25% Since the comments comments are sorted in a random order, we can simply take the first 4,170 for training and leave the remaining 1,389 for testing.
  • now, we will get the labels from the original comments_raw data frame
  • It’s suggested the spam comments be divided evenly between the two datasets.
  comments_dtm_train <- comments_dtm[1:4169,]
  comments_dtm_test <- comments_dtm[4170:5559,]
  comments_train_labels <- comments[1:4169,]$type
  comments_test_labels <- comments[4170:5559,]$type
  prop.table(table(comments_train_labels))
## comments_train_labels
##       ham      spam 
## 0.8776685 0.1223315
  prop.table(table(comments_test_labels))
## comments_test_labels
##       ham      spam 
## 0.8856115 0.1143885

Visualizing text data - word clouds

  • We will create a word cloud from our prepared comments corpus. The cloud will be arranged in a nonrandom order with higher frequency words placed closer to the center.
  • min.freq parameter specifies the number of times a word must appear in the corpus before it will be displayed in the cloud.
  • Let’s do more interesting visualization comparing the clouds for comments spam and ham
  • We’ll use the max.words parameter to look at the 40 most common words in each of the two sets.
  library("RColorBrewer")
  library("wordcloud")
## Warning: package 'wordcloud' was built under R version 3.3.3
  wordcloud(commentsData_corpus_clean,min.freq = 50,
            random.order = FALSE,colors=brewer.pal(8, "Dark2"))

  spam <- subset(comments,type=="spam")
  ham <- subset(comments,type=="ham")
  wordcloud(spam$text,max.words = 40,
            scale = c(3,0.5),colors=brewer.pal(8, "Dark2"))

  wordcloud(ham$text,max.words = 40,
            scale = c(3,0.5),colors=brewer.pal(8, "Dark2"))

Creating indicator features for frequent words

  • Finally: transform the sparse matrix into a data structure that can be used to train a Naive Bayes classifier.
  • Currently, the sparse matrix includes over 6,500 features; this is a feature for every word that appears in atleast one comment. It’s unlikely that all of these are useful for classification. To reduce the number of features, we will eliminate any word that appear in less than five comments, or in less than about 0.1 percent of the records in the training data.
  • we want all the rows, but only the columns representing the words in the frequent_terms vector

Training a model on the data

The Naive Bayes implementation we will employ is in the e1071 package

  • We will build our model on the comments_train matrix

  • The comments_classifier object now contains a naiveBayes classifier object that can be used to make predictions.

  comments_train <- apply(comments_dtm_freq_train,MARGIN = 2,convert_counts)
  comments_test <- apply(comments_dtm_freq_test,MARGIN = 2,convert_counts)
  #training the model
  library(e1071)
  comments_classifier <- naiveBayes((comments_train),as.factor(comments_train_labels))

Evaluating the model performance

To evaluate the comments classifier, we need to test its predictions on unseen comments in the test data. Recall that the unseen comment features are stored in a matrix named comments_test, while the class labels (spam or ham) are stored in a vector named comments_test_labels. The classifier that we trained has been named comments_classifier. We will use this classifier to generate predictions and then compare the predicted values to the true values.

  comments_test_pred <- predict(comments_classifier,comments_test)
  library("gmodels")
## Warning: package 'gmodels' was built under R version 3.3.3
  CrossTable(comments_test_pred,comments_test_labels,
  prop.chisq = FALSE,
  prop.t = FALSE,
  dnn = c("predicted","actual"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1390 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |       856 |        28 |       884 | 
##              |     0.968 |     0.032 |     0.636 | 
##              |     0.695 |     0.176 |           | 
## -------------|-----------|-----------|-----------|
##         spam |       375 |       131 |       506 | 
##              |     0.741 |     0.259 |     0.364 | 
##              |     0.305 |     0.824 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1231 |       159 |      1390 | 
##              |     0.886 |     0.114 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

Results

From the table above we can see that there are 321 ham comments mislassifiedas spam, and 27 spam comments misclassified as ham. So, overall from 1390 comments we have 348 misclassified. So, our accuracy rate is approximately - 75%.
Note: Those results may slightly vary, as in every code execution new random data set is generated.

Conclusion

The percent of accuracy reached via Naive Bayes is quite good, but I hope by applying more advanced cleaning and language processing techniques the accuracy results will increase.

The exploration and application of that techniques on this problem is one area I am going to work in future. I am also planning to apply more computer intensive classifiers such as Neural Nets on this problem.

References

AUA Data Science Course
AUA Machine Learning
Flowing Data
DiagrammeR
RGraphGallery
RVisualization
RShiny
Data Science Specialization(9 courses)
Stanford University ML
An Introduction to Statistical Learning
Statistical Learning Channel

<>